NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

In-memory Incremental Maintenance of Provenance Sketches

https://doi.org/10.48786/edbt.2026.05

Li, Pengyuan; Glavic, Boris; Gawlick, Dieter; Krishnaswamy, Vasudha; Liu, Zhen Hua; Porobic, Danica; Niu, Xing (January 2026, OpenProceedings.org)
Provenance-based data skipping

https://doi.org/10.14778/3494124.3494130

Niu, Xing; Glavic, Boris; Liu, Ziyu; Li, Pengyuan; Gawlick, Dieter; Krishnaswamy, Vasudha; Liu, Zhen Hua; Porobic, Danica (November 2021, Proceedings of the VLDB Endowment)

Database systems use static analysis to determine upfront which data is needed for answering a query and use indexes and other physical design techniques to speed-up access to that data. However, for important classes of queries, e.g., HAVING and top-k queries, it is impossible to determine up-front what data is relevant. To overcome this limitation, we develop provenance-based data skipping (PBDS), a novel approach that generates provenance sketches to concisely encode what data is relevant for a query. Once a provenance sketch has been captured it is used to speed up subsequent queries. PBDS can exploit physical design artifacts such as indexes and zone maps.
more » « less
Full Text Available
Using Reenactment to Retroactively Capture Provenance for Transactions

https://doi.org/10.1109/TKDE.2017.2769056

Arab, Bahareh Sadat; Gawlick, Dieter; Krishnaswamy, Vasudha; Radhakrishnan, Venkatesh; Glavic, Boris (November 2017, IEEE Transactions on Knowledge and Data Engineering)

Database provenance explains how results are derived by queries. However, many use cases such as auditing and debugging of transactions require understanding of how the current state of a database was derived by a transactional history. We present MV-semirings, a provenance model for queries and transactional histories that supports two common multi-version concurrency control protocols: snapshot isolation (SI) and read committed snapshot isolation (RC-SI). Furthermore, we introduce an approach for retroactively capturing such provenance using reenactment, a novel technique for replaying a transactional history with provenance capture. Reenactment exploits the time travel and audit logging capabilities of modern DBMS to replay parts of a transactional history using queries. Importantly, our technique requires no changes to the transactional workload or underlying DBMS and results in only moderate runtime overhead for transactions. We have implemented our approach on top of a commercial DBMS and our experiments confirm that by applying novel optimizations we can efficiently capture provenance for complex transactions over large data sets.
more » « less
Full Text Available
Heuristic and Cost-based Optimization for Diverse Provenance Tasks

https://doi.org/10.1109/TKDE.2018.2827074

Niu, Xing; Kapoor, Raghav; Glavic, Boris; Gawlick, Dieter; Liu, Zhen Hua; Krishnaswamy, Vasudha; Radhakrishnan, Venkatesh (April 2018, IEEE Transactions on Knowledge and Data Engineering)

A well-established technique for capturing database provenance as annotations on data is to instrument queries to propagate such annotations. However, even sophisticated query optimizers often fail to produce efficient execution plans for instrumented queries. We develop provenance-aware optimization techniques to address this problem. Specifically, we study algebraic equivalences targeted at instrumented queries and alternative ways of instrumenting queries for provenance capture. Furthermore, we present an extensible heuristic and cost-based optimization framework utilizing these optimizations. Our experiments confirm that these optimizations are highly effective, improving performance by several orders of magnitude for diverse provenance tasks.
more » « less
Full Text Available
SchemaDrill: Interactive Semi-Structured Schema Design

https://doi.org/10.1145/3209900.3209908

Spoth, William; Xie, Ting; Kennedy, Oliver; Yang, Ying; Hammerschmidt, Beda; Liu, Zhen Hua; Gawlick, Dieter (January 2018, Proceedings of the Workshop on Human-In-the-Loop Data Analytics)

Ad-hoc data models like JSON make it easy to evolve schemas and to multiplex different data-types into a single stream. This flexibility makes JSON great for generating data, but also makes it much harder to query, ingest into a database, and index. In this paper, we explore the first step of JSON data loading: schema design. Specifically, we consider the challenge of designing schemas for existing JSON datasets as an interactive problem. We present SchemaDrill, a roll-up/drill-down style interface for exploring collections of JSON records. SchemaDrill helps users to visualize the collection, identify relevant fragments, and map it down into one or more flat, relational schemas. We describe and evaluate two key components of SchemaDrill: (1) A summary schema representation that significantly reduces the complexity of JSON schemas without a meaningful reduction in information content, and (2) A collection of schema visualizations that help users to qualitatively survey variability amongst different schemas in the collection.
more » « less
Full Text Available

Search for: All records